# A tibble: 6 × 61
`Country Name` `Country Code` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aruba ABW NA NA NA NA NA NA NA
2 Afghanistan AFG NA NA NA NA NA NA NA
3 Angola AGO NA NA NA NA NA NA NA
4 Albania ALB NA NA NA NA NA NA NA
5 Andorra AND NA NA NA NA NA NA NA
6 Arab World ARB NA NA NA NA NA NA NA
# ℹ 52 more variables: `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>,
# `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>,
# `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
# `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
# `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
# `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>,
# `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, …
GDP growth data: long layout
Here’s an alternative layout for the annual GDP growth data:
# A tibble: 5 × 4
`Country Name` `Country Code` year GDP
<chr> <chr> <chr> <dbl>
1 Afghanistan AFG 1961 NA
2 Albania ALB 1961 NA
3 Algeria DZA 1961 -13.6
4 American Samoa ASM 1961 NA
5 Andorra AND 1961 NA
SB weather data: long layouts
A third example: daily minimum and maximum temperatures recorded at Santa Barbara Municipal Airport from January 2021 through March 2021.
weather1 <-read_csv('data/sb_weather.csv')# convert to date format specifying month/day/year formatweather1
# A tibble: 88 × 5
STATION NAME DATE TMAX TMIN
<chr> <chr> <chr> <dbl> <dbl>
1 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/1/2021 65 37
2 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/2/2021 62 38
3 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/3/2021 60 42
4 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/4/2021 72 43
5 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/5/2021 61 40
6 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/6/2021 71 39
7 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/7/2021 73 38
8 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/8/2021 79 36
9 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/9/2021 71 39
10 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 1/10/2021 67 37
# ℹ 78 more rows
weather1 <- weather1 |>mutate(DATE =mdy(DATE)) ## from the lubridate packageweather1
# A tibble: 88 × 5
STATION NAME DATE TMAX TMIN
<chr> <chr> <date> <dbl> <dbl>
1 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-01 65 37
2 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-02 62 38
3 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-03 60 42
4 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-04 72 43
5 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-05 61 40
6 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-06 71 39
7 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-07 73 38
8 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-08 79 36
9 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-09 71 39
10 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-10 67 37
# ℹ 78 more rows
A final example: United Nations country development data organized into different tables according to variable type.
undev1 <-read_csv('data/hdi3.csv', na ='..') |>select(-hdi_rank)undev2 <-read_csv('data/hdi2.csv', na ='..') |>select(-one_of(c('hdi_rank', 'maternal_mortality')))
Return to one of the examples and review the different layouts.
If you had to pick one layout, which would you choose?
Why would you choose that one?
Aesthetic preference?
“Just makes sense this way”?
“This way is better because…”?
Can you envision advantages or disadvantages to different layouts?
Few organizational constraints
It’s surprisingly difficult to articulate reasons why one layout might be preferable to another.
Possibly for this reason, most data are stored in a layout that made intuitive sense to someone responsible for data management or collection at some point in time.
Usually the choice of layout isn’t principled
Idiosyncratic: two people are likely to make different choices
As a result:
Few widely used conventions
Lots of variability ‘in the wild’
Datasets are often organized in bizarre ways
Consequences for the data scientist
Because of the wide range of possible layouts for a dataset, and the variety of choices that are made about how to store data, data scientists are constantly faced with determining how best to reorganize datasets in a way that facilitates exploration and analysis.
Broadly, this involves two interdependent choices:
Choice of representation: how to encode information.
Example: parse dates as ‘MM/DD/YYYY’ (one variable) or ‘MM’, ‘DD’, ‘YYYY’ (three variables)?
Example: use values 1, 2, 3 or ‘low’, ‘med’, ‘high’?
Example: name variables ‘question1’, ‘question2’, …, or ‘age’, ‘income’, …?
Choice of form: how to display information
Example: wide table or long table?
Example: one table or many?
Remedy: the tidy data standard
Instead of addressing these challenges – choice of form and representation – anew every single time, it is immensely helpful to have a set of organizational principles to standardize the process of rearranging data.
The tidy data standard is a principled way of organizing data values. It has two main advantages:
Facilitates workflow by establishing a consistent dataset structure.
Principles are designed to make transformation, exploration, visualization, and modeling easy.
This is a pretty intuitive idea. Many (most?) other things are easier when they’re thoughtfully organized.
Principles of tidy data
Tidy data matches semantics with structure
Data semantics: observations, variables, units
Tabular structure: rows and columns
The tidy data standard
Common messes
Tidying operations
Matching semantics with structure
“Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions.” Wickham and Grolemund, R for Data Science, 2017.
Matching semantics with structure
A dataset is a collection of values.
semantics: meaning
data semantics refers to the meaning of each value
structure: form
data structure refers to how values are arranged
The tidy standard: data semantics \(\longleftrightarrow\) data structure
Semantics: units, variables, and observations
Data semantics refers to the meaning of values. To introduce some general vocabulary, each value is
an observation
of a variable
taken on a unit.
To be precise:
An observational unit is the entity measured.
A variable is an attribute measured on each unit.
An observation is a collection of measurements taken on one unit.
Identifying units, variables, and observations
Let’s do an example. In the GDP growth data:
Term
Definition
Example
Observational units
Entity measured
Countries
Variables
Attributes measured
Year, GDP growth
Observations
Set of measurements per unit
Annual records
# third recordgdp2[3,]
# A tibble: 1 × 4
`Country Name` `Country Code` year GDP
<chr> <chr> <chr> <dbl>
1 Algeria DZA 1961 -13.6
So, below, -13.605441 (variable) in 1961 (variable) is a record (observation) for Algeria (unit).
Identifying units, variables, and observations
In the weather data:
Term
Definition
Instance
Observational unit
Entity measured
SB airport weather station (location)
Variables
Attributes measured
Min temp, max temp, date, station info
Observations
Set of measurements per unit
Daily records
# first recordweather1[1, ]
# A tibble: 1 × 5
STATION NAME DATE TMAX TMIN
<chr> <chr> <date> <dbl> <dbl>
1 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, CA US 2021-01-01 65 37
For example: 65 degrees Farenheit is the maximum temperature (one variable) recorded on a day (one observation) at the SB airport weather station (unit).
Data structure
Data structure refers to the form in which it is stored.
In this context, that means a tabular arrangement of a dataset comprising:
rows
columns
As we saw, there are multiple structures available to represent any dataset.
The tidy standard
The tidy standard consists in matching semantics and structure.
We can now make that precise. A dataset conforming to the tidy standard is organized so that:
Each variable is a column.
Each observation is a row.
Each table contains measurements on only one type of observational unit.
Tidy or messy?
Let’s revisit some of our examples of multiple layouts, starting with gdp1.
head(gdp1, n=3)
# A tibble: 3 × 61
`Country Name` `Country Code` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aruba ABW NA NA NA NA NA NA NA
2 Afghanistan AFG NA NA NA NA NA NA NA
3 Angola AGO NA NA NA NA NA NA NA
# ℹ 52 more variables: `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>,
# `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>,
# `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
# `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
# `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
# `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>,
# `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, …
Tidy or messy?
We can compare the semantics and structure for alignment:
Semantics
Observations
Annual records
Variables
GDP growth and year
Observational units
Countries
Structure
Rows
Countries
Columns
Value of year
Tables
Just one
Rules 1 and 2 are violated, since column names are values (of year), not variables. Not tidy.
Tidy or messy?
In gdp2:
head(gdp2, n=4)
# A tibble: 4 × 4
`Country Name` `Country Code` year GDP
<chr> <chr> <chr> <dbl>
1 Afghanistan AFG 1961 NA
2 Albania ALB 1961 NA
3 Algeria DZA 1961 -13.6
4 American Samoa ASM 1961 NA
Tidy or messy?
Comparison of semantics and structure:
Semantics
Observations
Annual records
Variables
GDP growth and year
Observational units
Countries
Structure
Rows
Annual records
Columns
GDP growth and year
Tables
Just one
All three rules are met: rows are observations, columns are variables, and there’s one unit type and one table. Tidy.
Tidy or messy?
head(weather2, n=4)
# A tibble: 4 × 8
STATION NAME DATE TMAX TMIN DAY MONTH YEAR
<chr> <chr> <date> <dbl> <dbl> <int> <dbl> <dbl>
1 USW00023190 SANTA BARBARA MUNICIPAL … 2021-01-01 65 37 1 1 2021
2 USW00023190 SANTA BARBARA MUNICIPAL … 2021-01-02 62 38 2 1 2021
3 USW00023190 SANTA BARBARA MUNICIPAL … 2021-01-03 60 42 3 1 2021
4 USW00023190 SANTA BARBARA MUNICIPAL … 2021-01-04 72 43 4 1 2021
Try this one on your own. Then compare with your neighbor.
Are the observational units the same or different?
Based on your answer above, is the data tidy or not?
Common messes
These examples illustrate some of the most common kinds of messiness:
Columns are values, not variables
GDP data gdp1: columns are 1961, 1962, …
Multiple variables are stored in one column
Mammal data mammal2: weight column contains both body and brain weights
Variables or values are stored in rows and columns
Weather data weather3: date values are stored in rows and columns, each column contains both min and max temperatures
Measurements on one type of observational unit are divided into multiple tables.
UN development data: undev1 stores population variables; undev2 stores gender-related variables.
Tidying operations
These common messes can be cleaned up by some simple operations:
pivot_longer
reshape a dataframe from wide to long format
pivot_wider
reshape a dataframe from long to wide format
joins
combine two dataframes row-wise by matching the values of certain columns
pivot_longer
Pivoting resolves the problem of having values stored as columns (common mess 1).
pivot_longer
To illustrate with gdp1:
head(gdp1, n=3)
# A tibble: 3 × 61
`Country Name` `Country Code` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aruba ABW NA NA NA NA NA NA NA
2 Afghanistan AFG NA NA NA NA NA NA NA
3 Angola AGO NA NA NA NA NA NA NA
# ℹ 52 more variables: `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>,
# `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>,
# `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
# `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
# `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
# `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>,
# `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>, `2001` <dbl>, …
gdp1 |>pivot_longer(cols=3:61,names_to="Year",values_to="GDP") |>arrange(Year, `Country Name`) # sort by year first then country
# A tibble: 15,576 × 4
`Country Name` `Country Code` Year GDP
<chr> <chr> <chr> <dbl>
1 Afghanistan AFG 1961 NA
2 Albania ALB 1961 NA
3 Algeria DZA 1961 -13.6
4 American Samoa ASM 1961 NA
5 Andorra AND 1961 NA
6 Angola AGO 1961 NA
7 Antigua and Barbuda ATG 1961 NA
8 Arab World ARB 1961 NA
9 Argentina ARG 1961 5.43
10 Armenia ARM 1961 NA
# ℹ 15,566 more rows
Pivoting to a wider format
Pivoting to a wider format resolves the issue of having multiple variables stored in one column (common mess 2).
Pivot wider
For example, the mammal2 layout can be put in tidier form with pivot_longer:
mammal2 |>pivot_wider(names_from ='measurement', # which variable(s) do you want to send to new column names?values_from ='weight'# which variable(s) do you want to use to populate the new columns?)
Common mess 3 is a combination of messes 1 and 2: values or variables are stored in both rows and columns. Pivoting and melting in sequence can usually fix this. weather3 illustrates this issue:
## First move date columns into a variableweather3_long <- weather3 |>pivot_longer(cols =6:36,names_to ='day',values_to ='temp' )weather3_long
# A tibble: 186 × 7
STATION NAME MONTH YEAR Type day temp
<chr> <chr> <dbl> <dbl> <chr> <chr> <dbl>
1 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 TMAX 1 65
2 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 TMAX 2 62
3 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 TMAX 3 60
4 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 TMAX 4 72
5 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 TMAX 5 61
6 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 TMAX 6 71
7 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 TMAX 7 73
8 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 TMAX 8 79
9 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 TMAX 9 71
10 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 TMAX 10 67
# ℹ 176 more rows
Pivot longer and wider
Then pivot_wider…
## First move date columns into a variableweather3_tidy <- weather3_long |>pivot_wider(names_from = Type, values_from = temp)weather3_tidy
# A tibble: 93 × 7
STATION NAME MONTH YEAR day TMAX TMIN
<chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 1 65 37
2 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 2 62 38
3 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 3 60 42
4 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 4 72 43
5 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 5 61 40
6 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 6 71 39
7 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 7 73 38
8 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 8 79 36
9 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 9 71 39
10 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 10 67 37
# ℹ 83 more rows
Pivot longer and wider
All in one chunk
## First move date columns into a variableweather3 |>pivot_longer(cols =6:36,names_to ='day',values_to ='temp' ) |>pivot_wider(names_from = Type, values_from = temp)
# A tibble: 93 × 7
STATION NAME MONTH YEAR day TMAX TMIN
<chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 1 65 37
2 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 2 62 38
3 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 3 60 42
4 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 4 72 43
5 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 5 61 40
6 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 6 71 39
7 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 7 73 38
8 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 8 79 36
9 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 9 71 39
10 USW00023190 SANTA BARBARA MUNICIPAL AIRPORT, C… 1 2021 10 67 37
# ℹ 83 more rows
Pivoting
Joins
Joining resolves the issue of storing observations or variables on one unit type in multiple tables (mess 4). The basic idea is to combine by matching rows.
However, there are a number of different joining rules (corresponding to SQL joins).
Join
The code below combines columns in each table by matching rows based on country.